Visualization Type: Interactive PCA Plot

Essential Question

How do Formula 1 constructors’ strategic choices across a racing season relate to their competitive success?

Overview

This project explores the high-dimensional relationship between strategic factors—such as participation, pit stops, and race performance—and final season outcomes by embedding these features into a two-dimensional principal component space and enabling interactive exploration.


Dataset Discussion

The dataset aggregates detailed Formula 1 constructor-level statistics across multiple seasons. Each record represents a constructor in a particular season and includes:

These features offer a nuanced view of a constructor’s engagement, efficiency, and competitiveness during a season.


Data Source and Reproducibility



Preprocessing Steps

  1. Handling Missing Data
    Missing values were handled carefully by imputing zeros where appropriate (e.g., zero pit stops for missed races), ensuring absence from events was accurately reflected without biasing performance metrics.

  2. Feature Scaling
    Principal Component Analysis (PCA) is sensitive to the scale of input features. All numeric features (e.g., average finishing position, podiums, races, pit stops) were standardized to have mean zero and unit variance before fitting the PCA model.

  3. Composite Metrics Construction
    Fine-grained race performance indicators were aggregated into composite scores to better capture a constructor’s overall seasonal effectiveness. This enabled PCA to learn richer structures in the data by focusing on patterns across multiple facets of performance.

  4. PCA Fitting
    After scaling, the PCA model was fit to the prepared feature set. The resulting principal components were interpreted based on feature loadings to understand the meaning of the embedded dimensions.


pit_summary <- pit_stops %>%
  group_by(raceId, driverId) %>%
  summarize(avg_pit_duration = mean(milliseconds, na.rm = TRUE), .groups = "drop")


constructor_data <- results %>%
  left_join(pit_summary, by = c("raceId", "driverId")) %>%
  left_join(races, by = "raceId") %>%
  left_join(constructors, by = "constructorId") %>%
  group_by(year, constructorRef) %>%
  summarize(
    avg_finish = mean(positionOrder, na.rm = TRUE),
    avg_pit_time = mean(avg_pit_duration, na.rm = TRUE),
    podiums = sum(positionOrder <= 3, na.rm = TRUE),
    races = n(),
    .groups = "drop"
  ) %>%
  drop_na()  # Remove rows with missing values


constructor_data <- constructor_data %>%
  mutate(label = paste(constructorRef, year, sep = ""))

pca_model <- constructor_data %>%
  select(avg_finish, avg_pit_time, podiums, races) %>%
  scale() %>%
  prcomp()


pca_df <- as_tibble(pca_model$x) %>%
  bind_cols(constructor_data)

print(pca_model$rotation)
##                     PC1        PC2         PC3         PC4
## avg_finish    0.6551643  0.2615229  0.03852060  0.70772996
## avg_pit_time -0.2738251  0.6529871 -0.70432559  0.05052837
## podiums      -0.6452645 -0.2941548  0.02868886  0.70447400
## races        -0.2818035  0.6470600  0.70825036 -0.01677982

Interpretation:

PC1 (“Race Success / Performance Metric”)

  • avg_finish (positive, 0.655) and podiums (negative, -0.645) are the biggest drivers.

  • A lower avg_finish (better rank) and higher podiums indicate better performance.

  • So PC1 captures overall race success: lower finishes and more podiums are associated with lower PC1 scores.

PC2 (“Participation and Pit Efficiency Metric”)

  • avg_pit_time (positive, 0.653) and races (positive, 0.647) are the biggest drivers.

  • High pit times and number of races load together.

  • So PC2 captures a pit and participation effect: teams that raced more and had longer pit times.

p <- ggplot(pca_df, aes(x = PC1, y = PC2)) +
  geom_point(
    aes(
      size  = podiums,
      color = avg_finish,
      text  = paste0(
        "Constructor: ", constructorRef, "<br>",
        "Season:      ", year,        "<br>",
        "Avg Finish:  ", round(avg_finish,2), "<br>",
        "Podiums:     ", podiums
      )
    ),
    alpha = 0.8
  ) +
  scale_color_viridis_c(direction = -1, option = "plasma") +
 scale_size_continuous(name = "Podiums", range = c(1, 10))+
  theme_minimal() +
  labs(
    title    = "PCA of Constructor Strategy by Season",
    subtitle = "Color = Avg Finish (lower = better), Size = Podiums",
    x        = "PC1: Composite Race Metrics",
    y        = "PC2: Pit & Participation Effects",
    color    = "Avg Finish",
    size     = "Podiums"
  ) +
  theme(legend.position = "right")

ggplotly(p, tooltip = "text") %>%
  layout(
    showlegend = TRUE 
  ) 

Design Choices

  1. Projection Technique
    PCA was selected to summarize high-dimensional constructor performance and strategy into a two-dimensional, human-interpretable space.

  2. Point Encoding

    • Color: Constructors are colored based on their average finishing position using a perceptually uniform viridis color scale. Darker shades represent stronger performance (lower average finish).
    • Size: Point size reflects the number of podium finishes, signaling constructors with more frequent top-3 finishes.
  3. Interactivity

    • Hover tooltips provide detailed constructor-level information (name, season, podiums, pit stops, average finish) without cluttering the static plot.
    • Zoom and pan functionality enable effective exploration of clusters and outliers.
  4. Axis Interpretation
    PCA loadings were used to interpret the axes:

    • PC1 (horizontal axis): Captures race performance and podium success.
    • PC2 (vertical axis): Reflects pit strategy and participation patterns.
  5. Design Trade-offs

    • Constructor labels were surfaced via interactivity (hover) rather than static placement to avoid overwhelming the visual space.
    • A perceptually uniform color palette was prioritized for accessibility.
    • A minimalistic theme was selected to focus attention on the data points.
  6. Reproducibility
    All design elements were generated programmatically from the data, ensuring consistency and reproducibility.


Key Findings


Insight

These findings challenge the assumption that mere participation or pit stop optimization leads to better outcomes. Instead, actual competitive race performance remains the strongest driver of success.